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Abstract 

Recently, five thyroid cancer significantly associated genetic variants (rs965513, 
rs944289, rsll6909374, rs966423, and rs2439302) have been discovered and val- 
idated in two independent GWAS and numerous case-control studies, which 
were conducted in difiisrent populations. We genotyped the above five single 
nucleotide polymorphisms (SNPs) in Han Chinese populations and performed 
thyroid cancer-risk predictions with nine machine learning methods. We found 
that four SNPs were significantly associated with thyroid cancer in Han Chinese 
population, while no polymorphism was observed for rsll6909374. Small famil- 
ial relative risks (1.02-1.05) and limited power to predict thyroid cancer 
(AUCs: 0.54-0.60) indicate limited clinical potential. Four significant SNPs have 
limited prediction ability for thyroid cancer. 
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Introduction 

Thyroid cancer is the fifth most common type of female 
cancer and its incidence is increasing. It has been consid- 
ered as one of highest familial risk carcinomas among all 
kinds of cancers [1, 2]. Most common diseases are caused 



by multiple genetic rather than few loci. In the last 2 years, 
two independent genome-wide association studies (GWAS) 
have been conducted to identify single nucleotide poly- 
morphisms (SNPs) associated with thyroid cancer risk. Five 
SNPs (rs965513, rs944289, rsl 16909374, rs966423, and 
rs2439302) which were highly significantly associated with 
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papillary thyroid carcinoma (PTC) were discovered by 
genome-wide association study. In addition, these five SNP 
were validated by continued case-control studies in more 
than three different populations (Han Chinese, Ohio, 
Poland, etc. Table 1). 

To examine the prediction ability based on variants 
with highly significant associations, we use aU five SNPs 
to predict thyroid cancer by nine classification methods 
(K-nearest neighbors, logistic regression, naive Bayes, ran- 
dom forest, support vector machine, Bayesian additive 
regression trees (BART), recursive partitioning, fuzzy 
rule-based system, boosting). Contradictory to our intui- 
tiveness, we found that although all these five SNPs were 
significantly associated with thyroid cancer, the precision 
of their prediction for thyroid cancer was very low. 

Methods 

The five SNPs were genotyped in 845 PTC and 1005 con- 
trols in Han Chinese population using the SNaPshot mul- 
tiplex single-nucleotide extension system. PTC patients 
who were treated in the Department of Head and Neck 
Surgery, Fudan University Shanghai Cancer Center, 
Shanghai, China from January to December 2010 were 
enrolled in this study. All patients were ethnically Chinese 
Han and came from Eastern China. A total of 1005 can- 
Table 1. Odds ratio for five SNPs from GWAS and case-control association study in previous study. 



OR (P-valueJ^ 



Study 


Population 


Method 


rs96B513 


rs944289 


rsl 16909374 


rs966423 


rs2439302 


Reference 


V 


Iceland 


GWAS 


1.73 (7.5e-13) 


1.48 (8.6e-7) 








[12] 




Iceland all 


Combined 


1.77 (6.8e-20) 


1.44 (2.5e-8) 












USA 


Case-control 


1.81 {1.2e-7) 


1.32 (1.2e-2) 












Spain 


Case-control 


1.54 (6.5e-3) 


1.14 (4.3e-1) 












USA and Spain 


Case-control 


1.72 (3.7e-9) 


1.26 (1.1e-2) 












All combined 


Combined 


1.75 (1.7e-27) 


1.37 (2.0e-9) 










f 


Chernobyl 


GWAS 


1.76 (4.9e-9) 


1.13 (0.17) 








[13] 






Combined 


1.65 {4.8e-12) 














Japan 


Case-control 


1.69 (1.27e-4) 


1.21 (0.0121) 








[14] 




UK 


Case-control 


1.98 (6.35e-34) 


1.33 (6.95e-7) 








[15] 




Iceland 


Case-control 


1.70 (3.0e-18) 


1.36 (4.2e-5) 


2.03 (5.4e-7) 


1.26(3.8e-4) 


1.41 (1.3e-6) 


[16] 




Netherland 


Case-control 




1.39 (0.013) 


1.95 (0.024) 


1.80(4.2e-6) 


1.24 (0.088) 






USA 


Case-control 




1.51 (0.0067) 


1.98 (0.018) 


1.36 {3.5e-3) 


1.33 (6.1e-3) 






Spain 


Case-control 




1.17 (0.31) 


3.37 (2.6e-3) 


1.20 (0.24) 


1.34 (0.073) 






All combined 


Case-control 




1.36 (4.9e-8) 


2.09 (4.6e-11) 


1.34 (1.3e-9) 


1.36 (2.0e-9) 






USA 


Case-control 


2.10 (<2e-16) 


1.28 (1.99e-3) 


1.97 (1.11e-3) 


1.35 {1.75e-4) 


1.51 (4.24e-7) 


[11] 




Poland 


Case-control 


1.78 (<2e-16) 


1.21 (3.55e-3) 


1.73 (6.27e-3) 


1.15 (3.13e-2) 


1.27 (2.20e-4) 






China 


Case-control 


1.53' (7.1e-4) 


1.5f (2.8e-9) 




1.3? (0.006) 


1.4tf (2.1e-4) 


[3] 








1.53^ (1.4e-4) 


1.5^ (2.0e-10) 




1.3f (0.001) 


1.4f (2.7e-5) 





GWAS, genome-wide association studies; OR, odds ratio. 

'ORs were calculated based on the multiplicative model. For the combined study populations, the OR value were estimated using the Mantel- 
Haenszel model. 

^ORs were calculated for the risk allele with using multiple logistic regression analyses. 



cer-free unrelated individuals were recruited from the 
Taizhou Longitudinal Study (TZL). The SNPs were geno- 
typed with the SNaPshot multiplex single-nucleotide 
extension system. Details of SNPs (Table SI) and primers 
were listed in our previous article [3]. 

The relative risk to daughters of an affected thyroid 
cancer individual attributable to a given SNP is calculated 
by the formula: = P(pr2+Hn fn{pn +qr .^^^ere p is the fi-e- 

quency of the risk allele, q = I — p, and r2 are the rela- 
tive risks (estimated by odds ratio [ORs]) for 
heterozygotes relative to common homozygotes and rare 
homozygotes relative to common homozygotes in the 
population, respectively [4, 5]. Assuming a multiplicative 
interaction, the proportion of the familial risk attributable 
to the SNP is calculated by log(/'.*)/log(/o), where is 
the overall familial relative risk (FRR), estimated to be 
8.48 for thyroid cancer [1]. Gender- and age-matched 
cases and controls were constructed by 1000 times resam- 
pling technology. 

Nine machine learning methods were used to make pre- 
diction for PTC from health individuals, including K-near- 
est neighbors [6], logistic regression, naive Bayes [7], 
random forest [8], support vector machine [7], BART [9], 
boosting, recursive partitioning, and fuzzy rule-based sys- 
tem [10]. The parameters in the models were optimally 
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Table 2. Estimation of familial relative risk of thyroid cancer for the 
five SNPs in population of Han Chinese. 



SNPs 


Familial 
relative risk 


Proportion (100%) 


P-value 


1-596551 3 


1 .01 89 


n 843 (0 806-0 8801 


<2.2e-1 6 




(1.0186-1.0192) 






S944289 


1.0419 


1.969 (1.922-2.016) 


<2.2e-16 




(1.0415-1.0422) 






rsl 16909374 


N.A.^ 


N.A.^ 


N.A.^ 


r5966423 


1 .0493 


2.191 (2.093-2.289) 


<2.2e-16 




(1.0485-1.0500) 






rs2439302 


1.0207 


0.977 (0.939-1.015) 


<2.2e-16 




(1.0205-1.0210) 







Vsl 16909374 SNP was not detected in the Chinese population. 



selected. Classification accuracy, sensitivity, specificity, and 
AUG were used to evaluate the performance of the meth- 
ods. They were calculated by 10-fold cross-validation. 

Results 

Marginal FRR of the significant SNPs 

As the previous studies showed that the five SNPs with 
large OR were significantly associated with thyroid cancer 
in various populations (Table 1). Our previous data also 
showed that SNPs were significantly associated with thy- 
roid cancer in Chinese population (the seventh study of 
Table 1). In present study, we estimated the FRR for five 
significantly associated SNPs in Chinese population. We 
found that the FRRs were low, ranging from 1.02 to 1.05. 
These five SNPs counted only 5.98% of the overall famil- 
ial risk (Table 2) which was very closed to that of polish 
population (about 6%) [11]. Our finding suggested that 
majority of the heritability was undiscovered. 



Genetic risl< prediction for thyroid cancer 
based on five SNPs 

The five significant SNPs were used to predict risk of thy- 
roid cancer by nine classification methods. The results 
were summarized in Table 3. The prediction accuracies 
ranged from 0.52 to 0.57 in the nine prediction methods, 
while receiver operating characteristics (ROCs) ranged 
from 0.54 to 0.60. The sensitivity of the prediction (0.28- 
0.48) was much less than specificity (0.56-0.76), which 
suggested the clinical application value might be limited 
(Table 3). In addition, the AUC of classification based on 
five SNPs and gender, and based on five SNPs, gender, 
and age ranged from 0.49 to 0.58, and from 0.50 to 0.59, 
respectively. This indicated that including gender and age 
information will not improve prediction (Tables S2 and 
S3, Fig. SI). 

Conclusion 

In the present study, we estimated the FRR and evaluated 
thyroid cancer prediction accuracy of the five SNPs that 
showed significant association with thyroid cancer in sev- 
eral association studies. The results showed that although 
the OR of each SNPs was large, the FRR of each SNPs 
was very marginal. By 10-fold cross-validation, we found 
that the prediction accuracy of five SNPs was low across 
all nine classification methods. Particularly, the sensitivity 
of five SNPs was very low. It suggested that the clinical 
application of five SNPs might be limited. Our results 
strongly demonstrate that complex diseases are caused 
by a large number of SNPs, environments, and their 
interactions. GWAS addressing common variants have 
come to its limit and missing heritability for most com- 
plex disorders is very high. Only about 5-10% heritability 



Table 3. Model performance with methods based on five significant SNPs. 

Range of 95% 





AUC 


Sensitivity 


Specificity 


Accuracy 


CI of AUC 


K-nearest neighbors 


0.5589 


0.3861 


0.6591 


0.533 


[0.4293, 0.7101] 


Logistic regression 


0.6044 


0.4982 


0.5648 


0.5346 


[0.4433, 0.7368] 


Naive Bayes 


0.5996 


0.3921 


0.7206 


0.5686 


[0.4571, 0.7469] 


Random forest 


0.5743 


0.3169 


0.7558 


0.5535 


[0.4405, 0.7233] 


Support vector machine 


0.5494 


0.2762 


0.7775 


0.547 


[0.4187, 0.7086] 


Bayesian additive 


0.5906 


0.4779 


0.5571 


0.5211 


[0.4385, 0.7211] 


regression trees 












Boosting 


0.6024 


0.4723 


0.5544 


0.5157 


[0.4584, 0.7287] 


Recursive partitioning 


0.5871 


0.4085 


0.7218 


0.5778 


[0.3926, 0.7048] 


Fuzzy rule-based system 


0.5396 


0.4931 


0.5006 


0.4968 


[0.4115, 0.6710] 



AUC, sensitivity, specificity, and accuracy were its mean value in 10-fold validations. Range of 95% CI of AUC represents the range of the 95% 
CI of AUC in 10-fold Cross-validation. SVM represents support vector machines and Kernel Methods. 
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was found based on common disease common variant 
(CDCV) model. To improve prediction of genetic varia- 
tion for complex diseases, we need to incorporate more 
common and rare SNPs, copy number variations (CNVs), 
and nongenetic susceptibility factors, such as iodine 
intake, exposure to radiation in the classification analysis. 
Novel statistical methods for variable screening should be 
developed to optimally select SNPs and CNVs across the 
genome for disease risk prediction. 
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Supporting Information 

Additional Supporting Information may be found in the 
online version of this article: 

Figure SI. ROC comparison among all the machine 
learning prediction methods. Nine machine learning 
method were used to make prediction for PTC from 
health individuals, including K-nearest neighbors (KNN), 
logistic regression (LR), naive Bayes, random forest, sup- 
port vector machine, Bayesian additive regression trees 
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(BART), boosting, recursive partitioning, fuzzy rule-based Table SI. Genomic information for five Acknowledged 
system. The parameters in the models were optimally SNPs from GWAS. 

selected. Classification accuracy, sensitivity, specificity and Table S2. Model performance with methods based on five 
AUG were used to evaluate the performance of the meth- SNPs and gender. 

ods. They were calculated by 10-fold cross-validation. Table S3. Model performance with methods based on five 

SNPs, gender, and age. 
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